Model comparison (done correctly) helps to choose the model that provides a good representation of the true DGP by penalizing models that “overfit”. This penalization is achieved mainly by assessing “fit” not on a training data set, but on a hold out test data set.
A complementary approach to work against “overfitting” is to specify priors that shrink model coefficients towards zero. Such shrinkage priors are typically normally distributed, have a mean of zero and a relatively small standard deviation. Here relative refers to the scale on which a predictor is measured.
To show how shrinkage works, we estimate spline models with different standard deviations on regression coefficients for the simulated income / well being data above.
The following figure shows the estimated relationships for different samples drawn from the population.
Hopefully you can see that large deviations between the true DGP in red and the estimated DGP in blue are less frequent when the prior on regression coefficients is narrow (top left) compared to when it is wider (bottom right).
To confirm this, the following plot shows deviances from a simulation that - samples 1000 times training and test data, - estimates model parameters on the training set - calculates deviances for the training and test data sets.
The following figure shows deviances +/- 1 sd.
load("sim_lppd.Rdata")
D.test = -2*elpd.test
D.train = -2*elpd.train
m.test = colMeans(D.test)
m.train = colMeans(D.train)
within_sd = function(m) {
apply(apply(m,2, function(x) x - rowMeans(m)), 2, sd)
}
lower.test = m.test - within_sd(elpd.test)
upper.test = m.test + within_sd(elpd.test)
lower.train = m.train - within_sd(elpd.train)
upper.train = m.train + within_sd(elpd.train)
par(mfrow = c(2,1), mar=c(2.5,2.5,.5,.5), mgp=c(1,.1,0), tck=-.01)
ylim = range(c(lower.train, upper.train))
xs = 1:ncol(elpd.test)
plot(xs,m.train, ylim = ylim,
ylab = "Deviance train", xlab = "", xaxt = "n")
axis(1, at = 1:5, labels = c(1,2,5,10,20))
arrows(xs,y0 = lower.train, y1 = upper.train, length = 0)
ylim = range(c(lower.test, upper.test))
plot(xs+.2, m.test, col = "blue", pch = 16, ylim = ylim,
ylab = "Deviance test", xlab = "sd(b)", xaxt = "n")
arrows(xs+.2,y0 = lower.test, y1 = upper.test, length = 0, col = "blue")
axis(1, at = 1:6, labels = b.sds)
#text(xs,m.train, labels = round(m.train,2), pos = 2, cex = .5)
#text(xs,m.test, labels = round(m.test,2), pos = 2, cex = .5, col = "blue")
Indeed, we see that while models with wider priors on the regression coefficients have a lower deviances in the training data, they have larger deviances for the test data.
So far we have implemented some simple cross validation manually, by simply simply splitting our total sample in half. However, data can also be split differently, e.g. 20% training data 80% test data or the other way around.
One popular way to split data to fit the data on \(N-1\) data points and to use the \(N\)th data point as test data. This is referred to as Leave one out cross validation or LOOCV.
One thing that is easily implemented with LOOCV is to calculated the deviance as the average deviance over the \(N\) LOOCV deviances. for LOOCV, there are always only N-1 training data sets.
This is different for alternative schemes. For instance, at a samples size of 20, there are 1.84756^{5} to construct the training data set. Here, averaging about all possible test-data deviances would be too computationally expensive.
The key thing to keep in mind when doing cross validation is that dependent observations (more specificall, observations with correlated errors) should always be kept in either test or training data sets. Such complications plays e.g. a role in time series or hierarchically organized data sets.
The computationally challenging part of LOOCV is that one needs to estimate the model \(N-1\) times.
Fortunately, one can approximate LOOCV with a method called importance sampling. Here, not all \(lppd\) receive the same weight when calculating the deviance, but \(lppd\) values of observations that have a strong influence